[Feat] [history server] Add node endpoint#4436
[Feat] [history server] Add node endpoint#4436JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
Conversation
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| @@ -0,0 +1,52 @@ | |||
| package eventserver | |||
There was a problem hiding this comment.
This file defines the interface and processing helpers (e.g., MergeStateTransitions) that can be reused across different events' state transitions.
| // injectCollectorRayClusterID injects the ray-cluster-id argument into all collector containers. | ||
| func injectCollectorRayClusterID(containers []corev1.Container, rayClusterID string) { | ||
| for i := range containers { | ||
| if containers[i].Name == "collector" { | ||
| containers[i].Command = append( | ||
| containers[i].Command, | ||
| fmt.Sprintf("--ray-cluster-id=%s", rayClusterID), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This patch is proposed by @machichima in this PR.
| body, err := io.ReadAll(resp.Body) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(resp.StatusCode).To(Equal(200), | ||
| gg.Expect(resp.StatusCode).To(Equal(http.StatusOK), |
There was a problem hiding this comment.
Use HTTP status code to preserve semantics without hardcoding the code number.
| // isLive indicates whether the response is from a live cluster or a dead cluster: | ||
| // - isLive: true for a live cluster (current snapshot) | ||
| // - isLive: false for a dead cluster (historical replay) | ||
| func verifyNodesRespSchema(test Test, g *WithT, nodesResp map[string]any, isLive bool) { |
There was a problem hiding this comment.
Considering that different endpoints will have exclusive schema verification, we should extract them to separate files to avoid messing up the main test file.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| func ConvertBase64ToHex(input string) (string, error) { | ||
| bytes, err := base64.StdEncoding.DecodeString(input) | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
|
|
||
| hexStr := hex.EncodeToString(bytes) | ||
|
|
||
| return hexStr, nil | ||
| } |
This reverts commit 8d7ff9a.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…ay-project#4464) Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
| // constructResourceString constructs a resource string based on the resources in state transition. | ||
| // Ref: https://github.com/ray-project/ray/blob/f953f199b5d68d47c07c865c5ebcd2333d49f365/python/ray/autoscaler/_private/util.py#L643-L665. | ||
| func constructResourceString(resources map[string]float64) string { | ||
| resourceKeys := make([]string, 0, len(resources)) | ||
| for k := range resources { | ||
| resourceKeys = append(resourceKeys, k) | ||
| } | ||
| sort.Strings(resourceKeys) | ||
|
|
||
| resourceString := "" | ||
| for _, k := range resourceKeys { | ||
| v := resources[k] | ||
|
|
There was a problem hiding this comment.
can we include other accelerator type here?
There was a problem hiding this comment.
can we write something like this?
if k == "memory" || k == "object_store_memory" {
formattedUsed := "0B"
formattedTotal := formatMemory(v)
resourceString += fmt.Sprintf("%s/%s %s", formattedUsed, formattedTotal, k)
} else if strings.HasPrefix(k, "node:") {
continue // Skip per-node resources
} else if strings.HasPrefix(k, "accelerator_type:") {
continue // Skip accelerator_type (Issue #33272)
} else {
// Handle CPU, GPU, TPU, and other resources
resourceString += fmt.Sprintf("%.1f/%.1f %s", 0.0, v, k)
}| resourceString = strings.TrimSuffix(resourceString, "\n") | ||
|
|
||
| return resourceString | ||
| } |
There was a problem hiding this comment.
Resource string omits GPU and custom accelerator resources
Medium Severity
constructResourceString only formats "CPU", "memory", and "object_store_memory" resources, using continue to skip everything else including "GPU" and other accelerator types. For GPU-enabled clusters, the resource string will be missing GPU information entirely. The referenced Python implementation at util.py#L643-L665 likely formats all resource types, not just these three.


Why are these changes needed?
This PR adds support for the
/nodesand/nodes/{node_id}endpoints to the history server.With these endpoints, the history server can reconstruct node state for terminated (dead) clusters, enabling post-mortem analysis through historical event replay. This includes node summaries and snapshots of node-level resources at different time points.
Change Summary
At a high level, this PR introduces changes across two main layers:
History Server Layer
/nodes?view=summaryand/nodes/{node_id}endpointsEvent Server Layer
NODE_DEFINITION_EVENTandNODE_LIFECYCLE_EVENTClusterNodeMap, and expose aGetNodeMaphelper for consumption by the history server layerTest Result
Related issue number
Closes #4376.
Checks